The fraud detection logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.
This is a sample of 1 row with headers explanation:
1,PAYMENT,1060.31,C429214117,1089.0,28.69,M1591654462,0.0,0.0,0,0
step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).
type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
amount - amount of the transaction in local currency.
nameOrig - customer who started the transaction
oldbalanceOrg - initial balance before the transaction
newbalanceOrig - new balance after the transaction
nameDest - customer who is the recipient of the transaction
oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).
newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).
isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.
isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.
Importing pandas,
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import tree, metrics
from io import StringIO
from IPython.display import Image
import pydotplus
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
Loading financial dataset:
filepath='C:\\Users\\61435\\Desktop\\SpringBoard_Assignments\\Capstone2-Fraud Detection\\Capstone2-Fraud-Detection\\Financial Datasets For Fraud Detection.csv'
df= pd.read_csv(filepath)
df.head()
Check for the data type of all columns:
df.info()
Check for the missing data:
missing = pd.concat([df.isnull().sum(), 100 * df.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count',ascending=False)
Conclusion:No missing data found. Analysing each features.Firstly,string data
df.select_dtypes('object')
Now let's check the numeric features.
df.select_dtypes(include = ['int64','float64'])
Conduct EDA on the 'Capstone2-Fraud Detection' to examine relationships between variables and other patterns in the data.
df.describe().T
df[df['isFraud']==1]
df[df['isFlaggedFraud']==1]
print(df.isFraud.value_counts())
sns.countplot(data=df, x='isFraud')
plt.ylabel('Count')
plt.show()
Count number of each type of transaction
print(df.type.value_counts())
df.type.value_counts().plot(kind='bar')
plt.show()
Check for relationship between isFraud and isFlaggedFraud:
pd.crosstab(df.isFraud,df.isFlaggedFraud)
df.groupby('type')['isFraud','isFlaggedFraud'].sum()
Conclusion:Fraud occurs only in 2 type of transactions: TRANSFER and CASH_OUT The type of transactions in which isFlaggedFraud is set : TRANSFER
df.hist(figsize=(15,10))
plt.subplots_adjust(hspace=1)
Categorical variable: type Check for the categorical variable in dataset:
#select categorical variable 'type'
df_cat = df.select_dtypes(include = 'object').copy()
#get counts of each variable value
df_cat.type.value_counts()
#count plot for one variable
sns.countplot(data = df_cat, x = 'type')
Create a boxplot for every column in df
boxplot = df.boxplot(grid=True, vert=False,fontsize=20)
Conclusion:In these cases outliers are expected as acounts are not interlinked and Bank allows customers to place all types of transactions.
fraud = df[df['isFraud']==1]
nonfraud = df[df['isFraud']==0]
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
f.suptitle('Amount per transaction by class')
bins = 25
ax1.hist(fraud.amount, bins = bins)
ax1.set_title('Fraud')
ax2.hist(nonfraud.amount, bins = bins)
ax2.set_title('Non Fraud')
plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.yscale('log')
plt.show();
Conclusion:Less number of transaction amount in fraud comapare to non fraud data
Create the correlation matrix heat map:Correlation coefficients for each variable in the dataset
plt.figure(figsize=(14,12))
sns.heatmap(df.corr(),linewidths=.1,cmap="YlGnBu", annot=True)
plt.yticks(rotation=0);
Conclusion: OldbalanceOrg, NewbalanceOrig are closely related. Similarly, OldbalanceDest, NewbalanceDest are closely related. isFraud is related to amount coloumn. So this will be the coloumn on which we need to work on. Moderately Strong Negative Correlation:NewbalanceDest and amount
Creat pair plots
#pair plots
sns.pairplot(df)
plt.show()
Check PCA for the correrated features:
features_array = ['amount','oldbalanceOrg','newbalanceOrig','oldbalanceDest','newbalanceDest']
# Separating out the features
features= df.loc[:, features_array].values
print(features)
# Standardizing the features
standard_features = StandardScaler().fit_transform(features)
print("**************************")
print(standard_features)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(features)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2'])
finalDf = pd.concat([principalDf, df[['isFraud']]], axis = 1)
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = [0,1]
colors = ['r', 'g']
for target, color in zip(targets,colors):
indicesToKeep = finalDf['isFraud'] == target
ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
, finalDf.loc[indicesToKeep, 'principal component 2']
, c = color
, s = 50)
ax.legend(targets)
ax.grid()
pca.explained_variance_ratio_
df.head()
df.info()
From above information we can conclude that type, nameOrig and nameDest are of object type.
df['type'].unique()
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
enc.fit(df['type'])
df['numeric_type'] = enc.transform(df['type'])
df.head()
df_new=df.drop(['type'],axis=1)
df_new.head()
print(f"% of transactions where difference between opening and closing balance of customer is not equal to transaction amount is {(1-len(df[np.abs(df.oldbalanceOrg-df.newbalanceOrig) == (df.amount)])/len(df))*100}")
print(f"% of transactions where difference between opening and closing balance of recipient is not equal to transaction amount is {(1-len(df[np.abs(df.oldbalanceDest-df.newbalanceDest) == (df.amount)])/len(df))*100}")
print(f"% of transactions where opening and closing balance of customer is equal to 0 but transaction amount is not equal to 0 is {(1-len(df[(df.oldbalanceOrg==0)&(df.newbalanceOrig==0)&(df.amount!=0)])/len(df))*100}")
print(f"% of transactions where opening and closing balance of recipient is equal to 0 but transaction amount is not equal to 0 is {(1-len(df[(df.oldbalanceDest==0)&(df.newbalanceDest==0)&(df.amount!=0)])/len(df))*100}")
print("% of transfer transactions where the opening and closing balance of the customer remained the same is " + str((len(df[(df.type=="CASH_OUT")&(df.oldbalanceOrg==df.newbalanceOrig)&(df.amount != 0)])/len(df))*100))
df_new[features_array].boxplot(figsize=(14,14),grid=False, rot=45, fontsize=15)
plt.show()
Outliers are excepted for this data set as outliers are expainable.
Conclusion: Only feature on which which was needed to be modified was 'type' as it was categorical in nature.
Seperating features from target(Defining X,y for modelling)
features=['amount','oldbalanceOrg','newbalanceOrig','oldbalanceDest','newbalanceDest','numeric_type','isFlaggedFraud']
X=df_new[features]
y=df_new['isFraud']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=246)
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=5).fit(X_train,y_train)
y_predLogistic=clf.predict(X_test)
pd.Series(y_predLogistic)
print(y_predLogistic)
clf
from sklearn.metrics import f1_score,classification_report,roc_auc_score
from sklearn.metrics import confusion_matrix
print(f"F1 score of Logistic Regression classifier is {f1_score(clf.predict(X_test),y_test)}")
print(f"AUC of Logistic Regression classifier is {roc_auc_score(clf.predict(X_test),y_test)}")
print('Precision score for "Yes"' , metrics.precision_score(y_test,y_predLogistic, pos_label =1))
print('Precision score for "No"' , metrics.precision_score(y_test,y_predLogistic, pos_label =0))
print('Recall score for "Yes"' , metrics.recall_score(y_test,y_predLogistic, pos_label =1))
print('Recall score for "No"' , metrics.recall_score(y_test,y_predLogistic, pos_label =0))
cf_matrix=confusion_matrix(y_test,y_predLogistic)
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
entr_model = tree.DecisionTreeClassifier(criterion="entropy", random_state = 42)
entr_model.fit(X_train,y_train)
y_pred=entr_model.predict(X_test)
pd.Series(y_pred)
print(y_pred)
entr_model
Now we want to visualize the tree
dot_data = StringIO()
tree.export_graphviz(entr_model, out_file=dot_data,
filled=True, rounded=True,
special_characters=True, feature_names=X_train.columns,class_names = ["NO", "YES"])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
print(f"F1 score of entr_model classifier is {f1_score(entr_model.predict(X_test),y_test)}")
print(f"AUC of entr_model classifier is {roc_auc_score(entr_model.predict(X_test),y_test)}")
print('Precision score for "Yes"' , metrics.precision_score(y_test,y_pred, pos_label =1))
print('Precision score for "No"' , metrics.precision_score(y_test,y_pred, pos_label =0))
print('Recall score for "Yes"' , metrics.recall_score(y_test,y_pred, pos_label =1))
print('Recall score for "No"' , metrics.recall_score(y_test,y_pred, pos_label =0))
cf_matrix=confusion_matrix(y_test,y_pred)
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
This is over fitting model, let's take max depth approach to get to appropriate model.
gini_model = tree.DecisionTreeClassifier(criterion='gini', random_state = 1234, max_depth = 3)
gini_model.fit(X_train, y_train)
y_pred2 = gini_model.predict(X_test)
y_pred2 = pd.Series(y_pred2)
gini_model
dot_data = StringIO()
tree.export_graphviz(gini_model, out_file=dot_data,
filled=True, rounded=True,
special_characters=True, feature_names=X_train.columns,class_names = ["NO", "YES"])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
print(f"F1 score of gini_model classifier is {f1_score(gini_model.predict(X_test),y_test)}")
print(f"AUC of gini_model classifier is {roc_auc_score(gini_model.predict(X_test),y_test)}")
print('Precision score for "Yes"' , metrics.precision_score(y_test,y_pred2, pos_label =1))
print('Precision score for "No"' , metrics.precision_score(y_test,y_pred2, pos_label =0))
print('Recall score for "Yes"' , metrics.recall_score(y_test,y_pred2, pos_label =1))
print('Recall score for "No"' , metrics.recall_score(y_test,y_pred2, pos_label =0))
cf_matrix=confusion_matrix(y_test,y_pred2)
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
gini_model2 = tree.DecisionTreeClassifier(criterion="gini", random_state= 34)
gini_model2.fit(X_train, y_train)
y_pred3 = gini_model2.predict(X_test)
y_pred3 = pd.Series(y_pred3)
gini_model2
dot_data = StringIO()
tree.export_graphviz(gini_model2 , out_file=dot_data,
filled=True, rounded=True,
special_characters=True, feature_names=X_train.columns,class_names = ["NO", "YES"])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
print(f"F1 score of gini_model2 classifier is {f1_score(gini_model2.predict(X_test),y_test)}")
print(f"AUC of gini_model2 classifier is {roc_auc_score(gini_model2.predict(X_test),y_test)}")
print('Precision score for "Yes"' , metrics.precision_score(y_test,y_pred3, pos_label =1))
print('Precision score for "No"' , metrics.precision_score(y_test,y_pred3, pos_label =0))
print('Recall score for "Yes"' , metrics.recall_score(y_test,y_pred3, pos_label =1))
print('Recall score for "No"' , metrics.recall_score(y_test,y_pred3, pos_label =0))
cf_matrix=confusion_matrix(y_test,y_pred3)
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
entr_model2 = tree.DecisionTreeClassifier(criterion="entropy", max_depth = 3, random_state = 254)
entr_model2.fit(X_train, y_train)
y_pred4 = entr_model2.predict(X_test)
y_pred4 = pd.Series(y_pred4)
entr_model2
import graphviz
dot_data = StringIO()
tree.export_graphviz(entr_model2, out_file=dot_data,
filled=True, rounded=True,
special_characters=True, feature_names=X_train.columns,class_names = ["NO", "YES"])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
print(f"F1 score of entr_model2 classifier is {f1_score(entr_model2.predict(X_test),y_test)}")
print(f"AUC of entr_model2 classifier is {roc_auc_score(entr_model2.predict(X_test),y_test)}")
print('Precision score for "Yes"' , metrics.precision_score(y_test,y_pred4, pos_label =1))
print('Precision score for "No"' , metrics.precision_score(y_test,y_pred4, pos_label =0))
print('Recall score for "Yes"' , metrics.recall_score(y_test,y_pred4, pos_label =1))
print('Recall score for "No"' , metrics.recall_score(y_test,y_pred4, pos_label =0))
cf_matrix=confusion_matrix(y_test,y_pred4)
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
RFModel = RandomForestClassifier(max_depth= 3, random_state= 42)
RFModel.fit(X_train, y_train)
y_pred5=RFModel.predict(X_test)
y_pred5 = pd.Series(y_pred5)
RFModel
print(f"F1 score of RFModel classifier is {f1_score(RFModel.predict(X_test),y_test)}")
print(f"AUC of RFModel classifier is {roc_auc_score(RFModel.predict(X_test),y_test)}")
print('Precision score for "Yes"' , metrics.precision_score(y_test,y_pred5, pos_label =1))
print('Precision score for "No"' , metrics.precision_score(y_test,y_pred5, pos_label =0))
print('Recall score for "Yes"' , metrics.recall_score(y_test,y_pred5, pos_label =1))
print('Recall score for "No"' , metrics.recall_score(y_test,y_pred5, pos_label =0))
cf_matrix=confusion_matrix(y_test,y_pred5)
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')
f, ax = plt.subplots(figsize=(18,5))
sns.barplot(x=["Logistic Regression","entr_model2- 3 depth","gini_model- 3 depth","RFModel"],
y=[roc_auc_score(clf.predict(X_test),y_test),
roc_auc_score(entr_model2.predict(X_test),y_test),roc_auc_score(gini_model.predict(X_test),y_test),roc_auc_score(RFModel.predict(X_test),y_test)])
plt.ylabel("AUC")
plt.xlabel("Models")
plt.show()
Conclusion: As we can see, Random Forest model is performing the best out of the four models in terms of F1 score, AUC and Confusion matrix. Accuracy cannot be used in this cases becasue of
df_probs = pd.DataFrame({"Actual":y_test,"Predicted":RFModel.predict_proba(X_test)[:,1],"Amount":X_test.amount})
print("% Frauds captured by 'Amount > 200,000' strategy in number and amount:")
print(len(df_probs[(df_probs.Amount > 200000) & (df_probs.Actual==1)])/len(df_probs[df_probs.Actual==1]))
print(sum(df_probs[(df_probs.Amount > 200000) & (df_probs.Actual==1)]["Amount"])/sum(df_probs[df_probs.Actual==1]["Amount"]))